glm…not GLM

The general linear model is one of the foundations of statistical inference.

The most important thing that you need to know is that it comprises three major techniques:

Regression

t-tests

ANOVA

Assumptions

As with nearly all statistical tests, the general linear model has some assumptions built into it for proper inference.

x = rnorm(1000)

y = x + rnorm(1000)

xyDat = data.frame(x = x, y = y)

plot(x, y)

testMod = lm(y ~ x)

hist(testMod$residuals)

plot(rnorm(100), rnorm(100))

plot(testMod$fitted.values, testMod$residuals)

Assumptions are very important. They are not, however, statistical death sentences.

Assumptions According To Gelman

Andrew Gelman has proposed the following assumptions for the modern linear regression:

  1. Validity

  2. Linearity

  3. Independence of errors

  4. Equal variance of errors

  5. Normality of errors

Modern Approaches

Speaking of modern, let’s get a few things out of the way:

We won’t be transforming variables for the sake of it!

We won’t be dropping outliers!

Regression

Do You Know…OLS

Linear regression is likely the most important statistical technique.

Let’s look at the following plot:

library(ggplot2)

ggplot(xyDat, aes(x, y)) +
  geom_point(alpha = .75) +
  theme_minimal()

To perform our regression analysis, we need to make a line pass through the data to satisfy the following conditions:

  1. It has to pass through the mean of x.

  2. It has to pass through the mean of y.

  3. The slope of the line needs to minimize the distance between the line and observations.

Here is a plot with all of those points marked.

library(dplyr)

datSummary = xyDat %>% 
  summarize_all(mean)

xyMod = lm(y ~ x, data = xyDat)

xyModDat = data.frame(resid = xyMod$fitted.values, 
                      fit = xyMod$model$x)

ggplot(xyDat, aes(x, y)) +
  geom_point(alpha = .75) +
  geom_point(data = datSummary, mapping = aes(x, y), color = "#ff5501", size = 5) +
  geom_hline(yintercept = datSummary$y, linetype = "dashed", color = "#ff5501") +
  geom_vline(xintercept = datSummary$x, linetype = "dashed", color = "#ff5501") +
  theme_minimal()

With those identified, we can fit a line through those points.

ggplot(xyDat, aes(x, y)) +
  geom_point(alpha = .75) +
  geom_smooth(method = "lm", se = FALSE) +
  geom_point(data = datSummary, mapping = aes(x, y), color = "#ff5501", size = 5) +
  geom_hline(yintercept = datSummary$y, linetype = "dashed", color = "#ff5501") +
  geom_vline(xintercept = datSummary$x, linetype = "dashed", color = "#ff5501") +
  theme_minimal()

Points and Lines

Now that we have these points and lines, what can we make of them? The goal of our regression model is to account for the variation in the dependent variable with the predictor variable. To that end, we have three different types of variation in our model:

  1. Total variation in y

  2. Explained variation in y

  3. Residual

variancePlot = ggplot(xyDat, aes(x, y)) +
  geom_point(alpha = .75) +
  geom_smooth(method = "lm", se = FALSE) +
  geom_point(data = datSummary, mapping = aes(x, y), color = "#ff5501", size = 5) +
  geom_hline(yintercept = datSummary$y, linetype = "dashed", color = "#ff5501") +
  geom_vline(xintercept = datSummary$x, linetype = "dashed", color = "#ff5501") +
  geom_segment(data = xyModDat, mapping = aes(xend = fit, yend = resid), linetype = "dashed") +
  theme_minimal()


plotly::ggplotly(variancePlot)

Important Terms

There are a few important terms to keep in mind:

Goodness Of Fit

\(R^2\)

Proportion of total explained variation.

Sums of squares play a very important roll in all of this.

\[\Sigma(y_i-\bar{y})^2 = \Sigma(\hat{y}-\bar{y})^2 + \Sigma(y - \hat{y})^2 \]

If we divide the explained sum of squares (\(\Sigma(\hat{y}-\bar{y})^2\)) by the total sum of squares part of the equation (\(\Sigma(y_i-\bar{y})^2\)), we will get the \(r^2\):

\[r^2 = \frac {\Sigma(\hat{y}-\bar{y})^2}{\Sigma(y_i-\bar{y})^2}\]

It can alternatively be expressed as:

\(R^2 = 1 -\frac {\Sigma(y - \hat{y})^2} {\Sigma(y_i-\bar{y})^2}\)

If you have a bivariate model, it is literally \(r^2\)

totalSS = sum((xyMod$model$y - mean(xyMod$model$y))^2)

explainedSS = sum((xyMod$fitted.values - mean(xyMod$model$y))^2)

residualSS = sum((xyMod$model$y - xyMod$fitted.values)^2)

r2 = explainedSS / totalSS

F-test

The F-test is used to determine whether the amount of variation explained is not due to chance. It is essentially testing whether our \(r^2\) is “significant”.

Our F is calculated as follows:

\[F = \frac {(\Sigma(y - \hat{y})^2)/v-1} {(\Sigma(\hat{y}-\bar{y})^2)/n-2} \]

We already know most of this, the top part has the residual sums of squares and the bottom part is the explained sums of squares. v is the number of variables and n is the sample size.

fStat = (residualSS / 1) / (explainedSS / (nrow(xyDat) - 2))

t-test

We use the t-test to determine if the slope of our line is different than zero. We are essentially testing each of our \(\beta\) (coefficients) for significant slope.

For Real

happy = read.csv("http://www.nd.edu/~sberry5/data/happy.csv", strip.white = TRUE)

basicExample = lm(Happiness.Score ~ Economy..GDP.per.Capita., data = happy)

summary(basicExample)

Call:
lm(formula = Happiness.Score ~ Economy..GDP.per.Capita., data = happy)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.96394 -0.48777 -0.07157  0.55947  1.60705 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 3.499      0.133   26.30   <2e-16 ***
Economy..GDP.per.Capita.    2.218      0.142   15.62   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7174 on 156 degrees of freedom
Multiple R-squared:  0.6099,    Adjusted R-squared:  0.6074 
F-statistic: 243.9 on 1 and 156 DF,  p-value: < 2.2e-16

What Does This Mean For Me?

The intercept is the fitted value of Happiness.Score when everything else is equal to 0.

The coefficient for “Economy.GDP” is saying that for every unit increase in GDP, the average change in the mean of Happiness goes up ~2.21 units.

Let’s add another term to our model:

twoPredictors = lm(Happiness.Score ~ Economy..GDP.per.Capita. + Generosity, data = happy)

summary(twoPredictors)

Call:
lm(formula = Happiness.Score ~ Economy..GDP.per.Capita. + Generosity, 
    data = happy)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.36245 -0.43946 -0.01671  0.54241  1.58794 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)                3.0898     0.1642  18.816  < 2e-16 ***
Economy..GDP.per.Capita.   2.2238     0.1359  16.369  < 2e-16 ***
Generosity                 1.7038     0.4323   3.941 0.000122 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6862 on 155 degrees of freedom
Multiple R-squared:  0.6454,    Adjusted R-squared:  0.6409 
F-statistic: 141.1 on 2 and 155 DF,  p-value: < 2.2e-16

With another term in the model, our coefficient for GDP has changed to 2.22

  • This is holding generosity constant.

Generosity has a coefficient of 1.70.

  • Holding GDP constant.

What do these patterns mean?

Do they make sense?

Let’s See How This Holds Up

\({mpg} = \alpha + \beta_{1} gdp_{t} + \beta_{2} generosity_{t} + \epsilon\)

twoPredictors$coefficients['(Intercept)'] + 
  twoPredictors$coefficients['Economy..GDP.per.Capita.'] * happy$Economy..GDP.per.Capita.[1] +
  twoPredictors$coefficients['Generosity'] * happy$Generosity[1]
(Intercept) 
   6.701021 
happy$Happiness.Score[1]
[1] 7.587

Visual Diagnostics

Using the plot method for our linear model will give us 4 very informative plots.

par(mfrow = c(2, 2))

plot(twoPredictors)

The “Residuals vs Fitted” plot is showing us our linear relationships – it should appear random!

The “Normal Q-Q” plot is giving us an idea about our multivariate normality (normally-distributed) – the points should hug the line.

The “Scale-Location” plot is similar to our “Residuals vs Fitted” plot, but is best for homoscedasticity detection – again, random is great!

Finally, “Residuals vs Leverage” shows us observations exhibiting a high degree of leverage on our regression.

A Demonstration

# predictors and response
N = 100 # sample size
k = 2   # number of desired predictors
X = matrix(rnorm(N*k), ncol=k)  
y = -.5 + .2*X[,1] + .1*X[,2] + rnorm(N, sd=.5)  # increasing N will get estimated values closer to these

dfXy = data.frame(X,y)

plot(dfXy$y, dfXy$X1)

Long Form

lmfuncLS = function(par, X, y){
  # arguments- par: parameters to be estimated; X: predictor matrix with intercept 
  # column, y: response
  
  # setup
  beta = par                                   # coefficients
  
  # linear predictor
  LP = X%*%beta                                # linear predictor
  mu = LP                                      # identity link
  
  # calculate least squares loss function
  L = crossprod(y-mu)
}
X = cbind(1, X)

init = c(1, rep(0, ncol(X)))

names(init) = c('sigma2', 'intercept','b1', 'b2')
optlmLS = optim(par = init[-1], fn = lmfuncLS, 
                X = X, y = y, control = list(reltol = 1e-8))

optlmLS$par
  intercept          b1          b2 
-0.58491379  0.36529496 -0.02432126 
modlm = lm(y~., dfXy)

summary(modlm)

Call:
lm(formula = y ~ ., data = dfXy)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.38590 -0.30280  0.03062  0.28765  1.42901 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.58489    0.05112 -11.442  < 2e-16 ***
X1           0.36529    0.05453   6.699 1.38e-09 ***
X2          -0.02430    0.05565  -0.437    0.663    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.505 on 97 degrees of freedom
Multiple R-squared:  0.3194,    Adjusted R-squared:  0.3053 
F-statistic: 22.76 on 2 and 97 DF,  p-value: 7.875e-09

QR Decomposition

QRX = qr(X)
Q = qr.Q(QRX) # Orthogonal matrix
R = qr.R(QRX) # Upper triangle
Bhat = solve(R) %*% crossprod(Q, y)
qr.coef(QRX, y)
[1] -0.58488558  0.36528559 -0.02430272

Pure Matrix Multiplication

coefs = solve(t(X)%*%X) %*% t(X)%*%y

Factor Variables

Let’s see if there might be anything interesting going on with the gender variable:

library(dplyr)

crimeScore = read.csv("http://nd.edu/~sberry5/data/crimeScore.csv")

genderScores = crimeScore %>% 
  group_by(SEX_CODE_CD) %>% 
  summarize(meanScore = mean(SSL_SCORE))

genderScores
# A tibble: 3 x 2
  SEX_CODE_CD meanScore
  <fct>           <dbl>
1 F                 283
2 M                 279
3 X                 266
factorTest = lm(SSL_SCORE ~ SEX_CODE_CD, data = crimeScore)

summary(factorTest)

Call:
lm(formula = SSL_SCORE ~ SEX_CODE_CD, data = crimeScore)

Residuals:
    Min      1Q  Median      3Q     Max 
-273.46  -37.69   10.31   42.31  221.31 

Coefficients:
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept)  283.4600     0.1868 1517.716   <2e-16 ***
SEX_CODE_CDM  -4.7709     0.2145  -22.246   <2e-16 ***
SEX_CODE_CDX -17.2845     7.6793   -2.251   0.0244 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 57.96 on 398681 degrees of freedom
Multiple R-squared:  0.001248,  Adjusted R-squared:  0.001243 
F-statistic:   249 on 2 and 398681 DF,  p-value: < 2.2e-16

These are called treatment contrasts. If you want to change the treatment, try something like the following:

factorTest2 = lm(SSL_SCORE ~ relevel(crimeScore$SEX_CODE_CD, ref = "X"), 
                 data = crimeScore)

summary(factorTest2)

Call:
lm(formula = SSL_SCORE ~ relevel(crimeScore$SEX_CODE_CD, ref = "X"), 
    data = crimeScore)

Residuals:
    Min      1Q  Median      3Q     Max 
-273.46  -37.69   10.31   42.31  221.31 

Coefficients:
                                            Estimate Std. Error t value
(Intercept)                                  266.175      7.677  34.672
relevel(crimeScore$SEX_CODE_CD, ref = "X")F   17.285      7.679   2.251
relevel(crimeScore$SEX_CODE_CD, ref = "X")M   12.514      7.678   1.630
                                            Pr(>|t|)    
(Intercept)                                   <2e-16 ***
relevel(crimeScore$SEX_CODE_CD, ref = "X")F   0.0244 *  
relevel(crimeScore$SEX_CODE_CD, ref = "X")M   0.1031    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 57.96 on 398681 degrees of freedom
Multiple R-squared:  0.001248,  Adjusted R-squared:  0.001243 
F-statistic:   249 on 2 and 398681 DF,  p-value: < 2.2e-16

No matter the reference category, we are allowing our intercept to differ for each level of the factor variable.

Ordered Factors

library(dplyr)

summary(as.factor(crimeScore$AGE_CURR))
                    20-30        30-40        40-50        50-60 
         241       142720        98975        63217        41602 
       60-70        70-80 less than 20 
       12014         1586        38329 
crimeScore$AGE_CURR[which(crimeScore$AGE_CURR == "")] = NA

crimeScore = crimeScore %>% 
  mutate(ageRec = relevel(AGE_CURR, ref = "less than 20"), 
         ageRec = as.ordered(ageRec))

If we include an ordered factor in our model, we might get a surprising result:

orderedMod = lm(SSL_SCORE ~ ageRec, data = crimeScore)

summary(orderedMod)

Call:
lm(formula = SSL_SCORE ~ ageRec, data = crimeScore)

Residuals:
     Min       1Q   Median       3Q      Max 
-256.450  -12.743   -1.450    9.566  200.550 

Coefficients:
              Estimate Std. Error  t value Pr(>|t|)    
(Intercept)  232.76793    0.07524 3093.618  < 2e-16 ***
ageRec.L    -225.85594    0.27998 -806.687  < 2e-16 ***
ageRec.Q      -4.81217    0.26525  -18.142  < 2e-16 ***
ageRec.C       1.84177    0.21288    8.652  < 2e-16 ***
ageRec^4      -3.52280    0.15723  -22.406  < 2e-16 ***
ageRec^5      -0.13370    0.11065   -1.208  0.22691    
ageRec^6      -0.23760    0.08216   -2.892  0.00383 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18.66 on 398436 degrees of freedom
  (241 observations deleted due to missingness)
Multiple R-squared:  0.8958,    Adjusted R-squared:  0.8958 
F-statistic: 5.709e+05 on 6 and 398436 DF,  p-value: < 2.2e-16

What R is returning is an orthogonal polynomial contrast. We are dealing with k-1 higher-order approximations of the trends of the variable (linear, quadratic, cubic, ^4, etc.). So in our model, we are looking at the effects of each trend level on our dependent variable.

Here is a better example:

(Intercept)       cut.L       cut.Q       cut.C       cut^4 
  4062.2364   -362.7254   -225.5798   -699.4965   -280.3564 

From looking at the visualization, we can see how these different “approximations” can fit the data pretty well.

Converting them to numeric will entail a careful theoretical examination of the question at hand and the nature of the ordinal categories, but you get the nice and easier numeric interpretation that comes along with the numeric. Converting them to factors leads us to the treatment contrasts that we used earlier.

Interactions

twoVars = lm(SSL_SCORE ~ WEAPONS_ARR_CNT + NARCOTICS_ARR_CNT, data = crimeScore)

summary(twoVars)

Call:
lm(formula = SSL_SCORE ~ WEAPONS_ARR_CNT + NARCOTICS_ARR_CNT, 
    data = crimeScore)

Residuals:
     Min       1Q   Median       3Q      Max 
-194.221  -28.029   -2.221   26.550  187.586 

Coefficients:
                  Estimate Std. Error t value            Pr(>|t|)    
(Intercept)       291.5470     1.5197 191.850 <0.0000000000000002 ***
WEAPONS_ARR_CNT    18.0598     1.0118  17.849 <0.0000000000000002 ***
NARCOTICS_ARR_CNT   2.8072     0.2933   9.573 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 49.69 on 9037 degrees of freedom
  (389644 observations deleted due to missingness)
Multiple R-squared:  0.04265,   Adjusted R-squared:  0.04243 
F-statistic: 201.3 on 2 and 9037 DF,  p-value: < 0.00000000000000022

Let’s explore interactions (moderation to some).

intMod = lm(SSL_SCORE ~ WEAPONS_ARR_CNT * NARCOTICS_ARR_CNT, data = crimeScore)

summary(intMod)

Call:
lm(formula = SSL_SCORE ~ WEAPONS_ARR_CNT * NARCOTICS_ARR_CNT, 
    data = crimeScore)

Residuals:
     Min       1Q   Median       3Q      Max 
-194.231  -27.995   -2.231   26.532  187.532 

Coefficients:
                                  Estimate Std. Error t value
(Intercept)                       292.1106     2.2340 130.754
WEAPONS_ARR_CNT                    17.5941     1.6896  10.413
NARCOTICS_ARR_CNT                   2.5461     0.8134   3.130
WEAPONS_ARR_CNT:NARCOTICS_ARR_CNT   0.2173     0.6312   0.344
                                              Pr(>|t|)    
(Intercept)                       < 0.0000000000000002 ***
WEAPONS_ARR_CNT                   < 0.0000000000000002 ***
NARCOTICS_ARR_CNT                              0.00175 ** 
WEAPONS_ARR_CNT:NARCOTICS_ARR_CNT              0.73069    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 49.69 on 9036 degrees of freedom
  (389644 observations deleted due to missingness)
Multiple R-squared:  0.04266,   Adjusted R-squared:  0.04234 
F-statistic: 134.2 on 3 and 9036 DF,  p-value: < 0.00000000000000022

The interpretation of our main effects don’t really change.

Despite not being significant, we would interpret our interaction here to mean that as either weapons arrests or narcotics arrests increases, the score increases by .73

crimeScoreGender = crimeScore %>% 
  filter(SEX_CODE_CD != "X") %>%
  select(SSL_SCORE, SEX_CODE_CD, WEAPONS_ARR_CNT)

intMod2 = lm(SSL_SCORE ~ WEAPONS_ARR_CNT * SEX_CODE_CD, data = crimeScoreGender)

summary(intMod2)

Call:
lm(formula = SSL_SCORE ~ WEAPONS_ARR_CNT * SEX_CODE_CD, data = crimeScoreGender)

Residuals:
     Min       1Q   Median       3Q      Max 
-264.092  -30.093   -0.093   29.908  182.908 

Coefficients:
                             Estimate Std. Error t value
(Intercept)                   316.467      7.939  39.862
WEAPONS_ARR_CNT                -2.251      7.202  -0.313
SEX_CODE_CDM                  -16.376      8.005  -2.046
WEAPONS_ARR_CNT:SEX_CODE_CDM   19.252      7.245   2.657
                                         Pr(>|t|)    
(Intercept)                  < 0.0000000000000002 ***
WEAPONS_ARR_CNT                           0.75465    
SEX_CODE_CDM                              0.04079 *  
WEAPONS_ARR_CNT:SEX_CODE_CDM              0.00788 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53.03 on 19141 degrees of freedom
  (379482 observations deleted due to missingness)
Multiple R-squared:  0.02465,   Adjusted R-squared:  0.02449 
F-statistic: 161.2 on 3 and 19141 DF,  p-value: < 0.00000000000000022

Compared to women, men’s score increases by 19.25 on average for each weapons arrest.

Sometimes it helps to see what is going on with a plot

library(effects)

modEffects = effect("WEAPONS_ARR_CNT*SEX_CODE_CD", intMod2)

plot(modEffects)

T-tests

What Are They Good For

You can use a t-test to test differences between two groups.

There are two general forms of the t-test:

  • Independent

  • Paired

Our Focus

We are going to focus mostly on comparing independent samples.

Unless you are going to be doing experimental work, you will probably not need to use paired tests.

Furthermore, you probably won’t ever really need to compare a sample to the population (requires you to know \(\mu\))

Tails

Like many other tests, the t-test can be tested with either one tail or two tails.

Alternative hypotheses can be any one of the following:

  • \(\neq\)

  • \(>\)

  • \(<\)

What is the difference?

  • Are you specifying the direction of your hypothesis or not?

One Or Two

In all seriousness, let’s consider the following plot:

r hist(rnorm(100000))

Let’s Give It A Try

t.test(crimeScoreGender$SSL_SCORE ~ crimeScoreGender$SEX_CODE_CD, 
       alternative = "two.sided")

    Welch Two Sample t-test

data:  crimeScoreGender$SSL_SCORE by crimeScoreGender$SEX_CODE_CD
t = 23.674, df = 180810, p-value < 0.00000000000000022
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 4.375940 5.165898
sample estimates:
mean in group F mean in group M 
       283.4600        278.6891 

Try it with different values for alternative and with var.equal = TRUE

Analysis Of Variance

ANOVA

ANOVA is a lot like a t-test, but you can have more than two groups.

Trying It Out

anovaTest = aov(SSL_SCORE ~ as.factor(ageRec), data = crimeScore, projections = TRUE)
summary(anovaTest)
                      Df     Sum Sq   Mean Sq F value              Pr(>F)
as.factor(ageRec)      6 1192519335 198753222  570899 <0.0000000000000002
Residuals         398436  138711757       348                            
                     
as.factor(ageRec) ***
Residuals            
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
241 observations deleted due to missingness

We now know that differences exist, but which groups are different?

We use Tukey’s Honestly Significant Difference test:

TukeyHSD(anovaTest)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = SSL_SCORE ~ as.factor(ageRec), data = crimeScore, projections = TRUE)

$`as.factor(ageRec)`
                         diff        lwr        upr p adj
20-30-less than 20  -35.73318  -36.04965  -35.41670     0
30-40-less than 20  -79.14464  -79.47559  -78.81369     0
40-50-less than 20 -123.27299 -123.62911 -122.91686     0
50-60-less than 20 -166.15984 -166.54933 -165.77036     0
60-70-less than 20 -207.85131 -208.42651 -207.27612     0
70-80-less than 20 -254.62194 -256.03157 -253.21231     0
30-40-20-30         -43.41146  -43.63902  -43.18391     0
40-50-20-30         -87.53981  -87.80263  -87.27699     0
50-60-20-30        -130.42667 -130.73317 -130.12016     0
60-70-20-30        -172.11814 -172.64073 -171.59555     0
70-80-20-30        -218.88876 -220.27776 -217.49977     0
40-50-30-40         -44.12835  -44.40843  -43.84826     0
50-60-30-40         -87.01520  -87.33664  -86.69377     0
60-70-30-40        -128.70667 -129.23815 -128.17520     0
70-80-30-40        -175.47730 -176.86966 -174.08494     0
50-60-40-50         -42.88686  -43.23415  -42.53956     0
60-70-40-50         -84.57833  -85.12583  -84.03082     0
70-80-40-50        -131.34895 -132.74751 -129.95039     0
60-70-50-60         -41.69147  -42.26124  -41.12170     0
70-80-50-60         -88.46210  -89.86952  -87.05467     0
70-80-60-70         -46.77063  -48.24032  -45.30094     0

What To Do?

Which Is The Appropriate Method?

Hopefully, we can see that these are all essentially identical.

We need to think about what exactly we are doing:

  • Are we predicting something?

  • Are we concerned about group differences?

  • Do we want to be limited?

  • Are we doing experimental work?

Rules Of Thumb

20 records per predictor…

  • Not happening here!

Power Analysis

Do you want to melt most people’s brains?

  • Don’t use rules of thumb!

  • Instead of trusting outdated advice, use actual science to determine how many people you need to find if a difference exists.

Power Analysis

We need three of the following parameters:

  • Effect size

  • Sample size

  • Significance level

  • Power

We should always be doing this a priori

  • Sometimes, it is fun to be a “statistical coroner”

Power

Power is ability to detect an effect.

  • In NHST words, we are trying to determine if we correctly reject the null hypothesis.

  • Type I errors: Reject a true \(H_{o}\) (false positive)

  • Type II errors: Reject a false \(H_{o}\) (false negative)

  • Which is more dangerous?

Putting It All Together

Let’s use the pwr package.

library(pwr)

pwr.f2.test(u = 1, v = NULL, f2 = .05, power = .8)

     Multiple regression power calculation 

              u = 1
              v = 156.9209
             f2 = 0.05
      sig.level = 0.05
          power = 0.8

In the function:

  • u is the numerator df (k - 1)

  • v is the denominator df (n - k)

  • f2 is signficance level

  • \(\Pi = 1 -\beta\)

  • \(\beta = Type\,II_{prob}\)

Power is typically set at .8, because it represents a 4 to 1 trade between Type II and Type I errors.

Your Turn!

Use various values to do an a priori power analyses.

How does the proposed sample size change as the number of predictors goes up?

What if you tweak the significance level?

What about power?

Different Test, Different Power Tests

We just did a test for a linear regression model.

Here is one for a t-test:

tPower = pwr.t.test(n = NULL, d = 0.1, power = 0.8, 
                    type= "two.sample", alternative = "greater")

plot(tPower)

Congratulations

  • Your knowledge now far exceeds most professionals!